Google Account
Vivek Singh
viveksinghzee02@gmail.com
This notebook is open with private outputs. Outputs will not be saved. You can disable this in Notebook settings.
Code
Insert code cell below
Ctrl+M B
Text
Add text cell
Notebook
Code Text

Project Name -PLAYSTORE APP REVIEW ANALYSIS.

Code Text

Project Type - EDA
Contribution - Individual
Code Text

Project Summary -


The play store data has enormous potential to drive app making bussiness to success. Actionable insights can be drawn for developers to work on and capture the android market. Each row in this project has values for category, rating,size and much more. In this project data has been analysed to discover key factors responsible for app engagment and success.

Code Text

GitHub Link -

Code Text

ψ
Code Text

Problem Statement

Code Text

ψ

To analyse the given data(Play Store App review Analysis) and to determine the various parameters which drive App development to there success.

To determine the public sentiments weather which kind of app is making place in there heart and why?

Code Text

Define Your Business Objective?

Code Text

Our EDA on play store app will help developers to know the sentiments of end user as a result of which appropriate decession by the developer can be taken to motivate the user to use there developed app only.

Code Text

General Guidelines : -


  1. Well-structured, formatted, and commented code is required.

  2. Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

    The additional credits will have advantages over other students during Star Student selection.

        [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                  without a single error logged. ]
  3. Each and every logic should have proper comments.

  4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.

# Chart visualization code
  • Why did you pick the specific chart?
  • What is/are the insight(s) found from the chart?
  • Will the gained insights help creating a positive business impact? Are there any insights that lead to negative growth? Justify with specific reason.
  1. You have to create at least 20 logical & meaningful charts having important insights.

[ Hints : - Do the Vizualization in a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis ]

Code Text

Let's Begin !

Code Text

1. Know Your Data

Code Text

Import Libraries

Code Text

import numpy as np
import pandas as pd
Code Text

Dataset Loading

Code Text

from google.colab import  drive
drive.mount('/content/drive')
Mounted at /content/drive
Code Text

Dataset First View

Code Text

ps_ar = pd.read_csv('/content/drive/MyDrive/almabetter programin asignment/EDA-2- PLAY STORE APP REVIEW ANALYSIS/Play Store Data.csv')
Code Text

ps_ar
Code Text

Dataset Rows & Columns count

Code Text

# Dataset Rows & Columns count
ps_ar.shape
(10841, 13)
Code Text

Dataset Information

Code Text

# Dataset Info
ps_ar.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
Code Text

Duplicate Values

Code Text

# Dataset Duplicate Value Count
ps_ar.drop_duplicates(inplace=True)
ps_ar.shape
# there were (10841-10358)= 483 duplicate rows and zero duplicate columns, duplicate rows are droped.


Missing Values/Null Values

Code Text

# Missing Values/Null Values Count
ps_ar.isna().sum()
App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64
Code Text

# Visualizing the missing values
ps_ar.info()##-- there are 1465 null values in rating column, 1 null in type and content rating column, 8 null in current verson, 3 in android verson
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
Code Text

What did you know about your dataset?

Code Text

There are total 10358 rows and 13 columns in my dataset(no-duplicate column or rows) out of which there are 1465 null values in rating column, 1 null value in type and content rating column, 8 nullvalue in current verson, 3 in android verson, Price column has all values 0.

Code Text

2. Understanding Your Variables

Code Text

# Dataset Columns
ps_ar.columns##-- total column names are visisble here
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')
Code Text

# Dataset Describe
ps_ar.describe()#--here we are trying to find the statistical data from data set
Code Text

Variables Description

Code Text

App- application name

Category- Category of application

Rating- star rating given by users

Reviews - users review

Size- application size

Installs- number of user who have installedd the app world wide

Type- Type of application

Price- paid/ free, if paid then how much price to pay

Content Rating- content quality rating

Genres- Type of genere

Last Updated

Current Ver- current version

Android Ver- Android version

Code Text

ψ

Check Unique Values for each variable.

Code Text

ps_ar.Genres.unique()
array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity', 'Communication', 'Dating',
       'Education', 'Education;Creativity', 'Education;Education',
       'Education;Action & Adventure', 'Education;Pretend Play',
       'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Casual;Pretend Play', 'Strategy', 'Action', 'Puzzle', 'Sports',
       'Music', 'Word', 'Racing', 'Casual;Creativity', 'Simulation',
       'Adventure', 'Board', 'Trivia', 'Role Playing',
       'Action;Action & Adventure', 'Casual;Brain Games',
       'Simulation;Action & Adventure', 'Educational;Creativity',
       'Puzzle;Brain Games', 'Educational;Education', 'Card;Brain Games',
       'Educational;Brain Games', 'Educational;Pretend Play',
       'Casual;Action & Adventure', 'Entertainment;Education',
       'Casual;Education', 'Music;Music & Video', 'Arcade;Pretend Play',
       'Simulation;Pretend Play', 'Puzzle;Creativity',
       'Sports;Action & Adventure', 'Racing;Action & Adventure',
       'Educational;Action & Adventure', 'Arcade;Action & Adventure',
       'Entertainment;Action & Adventure', 'Puzzle;Action & Adventure',
       'Role Playing;Action & Adventure', 'Strategy;Action & Adventure',
       'Music & Audio;Music & Video', 'Health & Fitness;Education',
       'Adventure;Education', 'Board;Brain Games',
       'Board;Action & Adventure', 'Board;Pretend Play',
       'Casual;Music & Video', 'Education;Music & Video',
       'Role Playing;Pretend Play', 'Entertainment;Pretend Play',
       'Video Players & Editors;Creativity', 'Card;Action & Adventure',
       'Medical', 'Social', 'Shopping', 'Photography', 'Travel & Local',
       'Travel & Local;Action & Adventure', 'Tools', 'Personalization',
       'Productivity', 'Parenting', 'Parenting;Education',
       'Parenting;Brain Games', 'Parenting;Music & Video', 'Weather',
       'Video Players & Editors', 'News & Magazines', 'Maps & Navigation',
       'Health & Fitness;Action & Adventure', 'Educational', 'Casino',
       'Adventure;Brain Games', 'Video Players & Editors;Music & Video',
       'Trivia;Education', 'Lifestyle;Education',
       'Books & Reference;Creativity', 'Books & Reference;Education',
       'Simulation;Education', 'Puzzle;Education',
       'Role Playing;Education', 'Role Playing;Brain Games',
       'Strategy;Education', 'Racing;Pretend Play',
       'Communication;Creativity', 'Strategy;Creativity'], dtype=object)
Code Text

3. Data Wrangling

Code Text

Data Wrangling Code

Code Text

# Write your code to make your dataset analysis ready.
ps_ar.replace('0',np.nan,inplace = True)
#droping type and price column as in price column every value is nan and type column has free written.
ps_ar.drop(['Price','Type'],axis=1, inplace=True)
ps_ar



Code Text

#converting the datatypes of ratings from object to float.
ps_ar[['Rating']]= ps_ar[['Rating']].astype(float)
ps_ar[['Reviews']]=ps_ar[['Reviews']].astype(float)




Code Text

#imputing nan values with mean value in Rating column as it has 1474 nan values.
ps_ar['Rating'].fillna(ps_ar['Rating'].astype(float).mean(), inplace=True)
Code Text

#imputing 'Varies with device' with nan values.
ps_ar.replace('Varies with device',np.nan,inplace = True)


Code Text

#droping nan values in the column Current version and Android verson as values of nan can not be imputed with mean, median or mode also only 8 and 3 nan are there.
ps_ar.dropna(subset=['Current Ver','Android Ver'], axis=0, inplace=True)

Code Text

ps_ar.shape
(9299, 11)
Code Text

ps_ar.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9299 entries, 0 to 10838
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             9299 non-null   object 
 1   Category        9299 non-null   object 
 2   Rating          9299 non-null   float64
 3   Reviews         8738 non-null   object 
 4   Size            9049 non-null   object 
 5   Installs        9299 non-null   object 
 6   Content Rating  9299 non-null   object 
 7   Genres          9299 non-null   object 
 8   Last Updated    9299 non-null   object 
 9   Current Ver     9299 non-null   object 
 10  Android Ver     9299 non-null   object 
dtypes: float64(1), object(10)
memory usage: 871.8+ KB
Code Text

#Finding outliers using box plot for rating.
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x='Rating', data=ps_ar)

plt.show()


What all manipulations have you done and insights you found?


Manipulations-

  • In price column all values were 0, so filled this zero value with nan, and droped it, however direct droping of price column can also be done without replacing the zero with nan.

  • converted the datatype of Rating from object to float, because its a number.

  • converted the datatype of Reviews and Installs by using pd.to_numeric as there values consistes of numerical and string both.

  • imputed the nan values in rating column with mean values as there were 1474 nan values and we can not drop such a huge number as on droping it may hamper our analysis. so nan values were imputed by mean values.

  • In column android verson and Current version 'Varies with device' was replaced by nan values and later on it was droped as only 8 and 3 were such cases on droping them it will not hamper our analysis.

Insights found-

  • Now our dataframe is ready to be analysed on various parameters and in various ways.

  • proper insights will be available after analysing the data and will be written in conclusion part.

  • Determined the outliers, so that if any unwanted data is there it can be discarded while taking decession.

Code Text

4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables

Code Text

Chart - 1

Code Text

# Chart - 1 visualization code
#Histogram of Category of apps
import matplotlib.pyplot as plt
import seaborn as sns
plt.hist(ps_ar['Category'], bins=20)
plt.xlabel('Category')
plt.xticks(rotation = 90)
plt.ylabel('Frequency')
plt.title('Category wise app analysis')
plt.show()

Code Text

1. Why did you pick the specific chart?
Code Text

To know which Category of apps is most used,liked by end users. histogram is ploted for univariate analysis of numerical discrete values.


2. What is/are the insight(s) found from the chart?
Code Text

  • top most used app by end users belongs to the category of - family app and gaming app.
  • least used app by end users belongs to the category of - Parenting app and libraries_and_demo app.
  • average used app by end users belongs to the category of - tools and personalization.
Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

yes, from the insights we get a clear idea in which category we have more scope of doing business, for example if we launch a gaming app it probability of getting successfull is more compare launching of other category app.

For any insights that lead to negative growth further investigation is needed.

Code Text

Chart - 2

Code Text

# Chart - 2 visualization code
#Bar graph for category and installs
x= ps_ar.loc[:,'Category']
y=ps_ar.loc[:,'Installs']
plt.bar(x,y)
plt.title('Category vs Installs')
plt.xlabel('Category')
plt.xticks(rotation =90)
plt.ylabel('Installs')
plt.show()

Code Text

1. Why did you pick the specific chart?
Code Text

For doing bi-variate analysis, between category of apps and installs. Bar plot helps in bi-variate analysis thats why it is used. From here we get insights that which category has how much downloads.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

-Insights

-higest download is for- news and magazines, Personalization, productivity, travel and local,Family, medical, social, lifestyle, finanace,Business, art and design, all these are having more than 100m+ downloads.

-lowest app download is for entertainment.

-shoping, photography and sports have equal amount of downloads and is greater than vedio players,weather and parenting apps.

Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

yes the gained insights will help creating a positive business impact

no insights found which may lead to negative growth.

Code Text

Chart - 3

Code Text

# Chart - 3 visualization code
#Bar graph for category and ratings.
x=ps_ar.loc[:,'Category']
y=ps_ar.loc[:,'Rating']
plt.bar(x,y)
plt.title('Category vs Rating')
plt.xlabel('Category')
plt.xticks(rotation =90)
plt.ylabel('Rating')
plt.show()
Code Text

1. Why did you pick the specific chart?
Code Text

For doing bi-variate analysis, between category of apps and ratings. Bar plot helps in bi-variate analysis thats why it is used. From here we get insights that which category of app has how much ratings.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

Most of the apps have a rating of 5 only 8 apps have rating less than 5.

Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

no such insights found.


Chart - 4

Code Text

# Chart - 4 visualization code
# scatter plot for reviews and rating
x=ps_ar.loc[:,'Genres']
y=ps_ar.loc[:,'Rating']
plt.bar(x,y)
plt.title('Genres vs Rating')
plt.xlabel('Genres')
plt.xticks(rotation =90)
plt.ylabel('Rating')
plt.show()





Code Text

1. Why did you pick the specific chart?
Code Text

to know weather there is any relationship between geners and rating

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

no significant fruitfull insights are obtained from the graph.

Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

N/A

Code Text

Chart - 5

Code Text

# Chart - 5 visualization code
#scatter  plot of content rating and rating
plt.scatter(ps_ar['Rating'], ps_ar['Content Rating'])
plt.xlabel('Rating')
plt.ylabel('Content Rating')
plt.title('content rating and rating')
plt.show()
Code Text

1. Why did you pick the specific chart?
Code Text

To do bi-variate analysis between content Rating and rating.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

Insights-

Most of the end user have liked 10+ year age group content,mostely give an atleat avg rating more than 3.4.

wide variation in rating is seen in those content which are made for every age group people

a very few or least rating is seen for those contents which are 18+ or unrated content. it shows the least intrest of end users in those content.

Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

ψ

while observing the rating which is having content 18+, it is concluded that peoples are showing least intrest in them so making such content is not advisible.

The most liked content is the content which is made by keeping in mind for all age group.

The gained insights from the above graph will create a positive impact on business.

Code Text

Chart - 6

Code Text

# Chart -6 visualization code
#bar plot between last updated and category
f, ax = plt.subplots(figsize=(25,5))

x=ps_ar.loc[:,'Last Updated']
y=ps_ar.loc[:,'Category']
plt.bar(x,y)
plt.title('Category vs Last Updated')
plt.xlabel('Last Updated')
plt.xticks(rotation =90)
plt.ylabel('Category')
plt.show()

# sns.swarmplot(data=ps_ar, x='Last Updated', y='Category')
# plt.show()

Code Text

1. Why did you pick the specific chart?
Code Text

To do bi-variate analysis between last updated and category.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

Almost Dificult to find the insights as there is a huge data labels on x asix are not vissible properly eventhough on changing the size of graph.

Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

N/A

Code Text

Chart - 7

Code Text

# Chart - 7 visualization code
#scatter plot of Genres and Installs
plt.scatter(ps_ar['Genres'], ps_ar['Installs'])
plt.xlabel('Genres')
plt.xticks(rotation=90)
plt.ylabel('Installs')
plt.title('Genres vs Installs')
plt.show()
Code Text

1. Why did you pick the specific chart?
Code Text

To do a bi-variate analysis of category and install using scatter plot.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

No significant relation between category and installs is seen from the plot. Plot is showing random distribution of dots.

Code Text

3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

N/A

Code Text

Chart - 8

Code Text

# Chart - 8 visualization code
# sns.histplot(x='Current Ver', data=ps_ar)
# plt.xticks(rotation =90)
                           #OR

plt.hist(ps_ar['Current Ver'])
plt.xlabel('Current Verson')
plt.xticks(rotation = 90)
plt.title('Current Verson analysis')
plt.show()



Code Text

ψ
1. Why did you pick the specific chart?
Code Text

To do a uni- veriate analysis of current version of apps.

Code Text

ψ
2. What is/are the insight(s) found from the chart?
Code Text

Most of the apps are having a version higher than 5. few apps versions are variying with device. very few apps are running on a version of less than 3.

Code Text

ψ
3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

Those apps which are having version less tha 4 must start giving frequent updates so that end users may get benifited and stay updated with time.

Code Text

ψ

Chart - 9

Code Text

# Chart - 9 visualization code
#scatter plot between Rating and reviews
sns.scatterplot(data=ps_ar, x='Rating',y='Reviews')
plt.title('Rating vs Review')
plt.show()

Code Text

ψ
1. Why did you pick the specific chart?
Code Text

ψ

To see the trend between Rating and Review.

Code Text

ψ
2. What is/are the insight(s) found from the chart?
Code Text

ψ

Most of the reviews are clustred in the range of 3.8 to 4.8 ratings.

Code Text

ψ
3. Will the gained insights help creating a positive business impact?

Are there any insights that lead to negative growth? Justify with specific reason.

Code Text

ψ

This analysis of rating and review does not provide any fruitfull insights which may lead negative or positive impact on business.

Code Text

ψ

Chart - 10 - Correlation Heatmap

Code Text

# Correlation Heatmap visualization code
sns.heatmap(ps_ar[['Rating','Reviews']].corr(),cmap='vlag', annot = True)
Code Text

1. Why did you pick the specific chart?
Code Text

ψ

For multi variate analysis between rating and reviews, as not much fruitfull insights were obtained from scatter plot of rating and reviews.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

ψ

stronger and positive corelation is seen between rating and reviews.

Code Text

ψ

Chart - 11 - Pair Plot

Code Text

# Pair Plot visualization code
sns.pairplot(data= ps_ar)
Code Text

1. Why did you pick the specific chart?
Code Text

ψ

For multi variate analysis.

Code Text

2. What is/are the insight(s) found from the chart?
Code Text

ψ

Rating are clustred is a specific range of 3.8 to 4.8. It means most of the users has provided a positive review and have given a good rating.

Few users have given a rating of 5 and few have given below 2 average rating.

Code Text

5. Solution to Business Objective

Code Text

What do you suggest the client to achieve Business Objective ?

Explain Briefly.

Code Text

ψ

Following points are suggested to client to achive Business Objective:-

  • Frequent updates should be provided to the customes in this fast changing world.

  • Most of the end users are likeing news and magazines, Personalization, productivity, travel and local,Family, medical, social, lifestyle, finanace,Business, art and design, types of application, so we must focus on these segments only.

  • Little scope of scaling up of user is observed in shoping, photography and sports segment of applications.

  • Conteneous engagment of customer and there feedback must be taken to improve the content quality of app to increase the end user engagment with the developed application.

  • Higher number of user's review shows positive sentiments of users for the perticular application.

Code Text

Conclusion

Code Text

ψ

In this project of analyzing Google Play Store applications, i have mostly focused on finding out the relationship between Rating and Installations number and what is the expected rating if number of reviews and installation is provided.

I have started Data Science process which is Data Preparing, Data Cleansing and Data Analysis. In Data Cleansing, i have performed few steps to ensure the data quality such as removing NAN values. With the cleansed data, i have perform Exploratory Data Analysis to understand our dataset like number of installation for each category.

From the results, we can know the relationships between Geners and installations is a very weak relation as close as no relations at all. We can say that the trends between Geners and Installations is not dependent to each other.

From the results and process i have implemented, we can conclude that i have achieved this group project objectives which are analyzing the Google Play Store apps and determine trends of the Google Play Store and our focused questions.

Code Text